Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Wed Dec 02 14:49:54 2020"

The text continues here.

Exercise 1 - getting familiar with RStudio, RMarkdown, GitHub

by SusiCS

How I am feeling right now?

Well, I would say excited but also confused :) I am just at the beginning of learning to use R and then also using GitHub makes it more complicated but it is also a great challenge. I think it will help me a lot to understand my the work of others using R. Besides, it is always good to learn something new!!

What I am expecting to learn?

  • I conducted a clinical research, with made a lot of fun and is very interesting, although I had to handle a lot of stool samples.
  • However, omics data will be analyzed by another PhD student using R. Therefore, I should be able to know how he worked with R and because I am in a big EU project it would be useful to work with open research tools, so that we can better communicate.
  • And I need a better understanding in statistics :) ... ... ... ...
  • AND it makes a nice impression using fancy graphs XP

Where I heard from this course?

Through our UEF Yammer --> I am just a bit late, saw the add this Thursday morning.

Chapter 2: Regression and Model variation

Describe the work you have done this week and summarize your learning.

date()
## [1] "Wed Dec 02 14:49:54 2020"

Here we go again...

Data wrangling

Integrating data from the web into R

Data frame is learning2014 with 183 observations and 60 variables. The description of the variables are available here

how to do that , see create_learning2014.R in my github

Data analysis

1. reading the data file from another source

learning2014<- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", sep=",", header=T)

dim(learning2014)
## [1] 166   7
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

The data file includes 166 observations and 7 variables. The variables are gender, age, attitude (global attitude towards statistic, includes 10 questions, displayed is the mean of 10 questions), deep (deep questions includes 12 questions, displayed is the mean of the 12 questions), stra (strategic questions includes 8 questions, displayed is the mean of 8 questions), surf (surface questions, include 12 questions, displayed is the mean of the 12 questions) and points (exam points). Attitude, deep, stra, surface questions were answered view Lieke scale.

2. Graphical overview

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.3
library(GGally)
## Warning: package 'GGally' was built under R version 4.0.3
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
# alpha=0.5 lightens the colour, bins for the visual scale display
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.1), lower = list(combo = wrap("facethist", bins = 20)))
p

This graph shows an overview of data distribution and correlation. First of all, in this data set more women participated than men. Most of the subjects were younger than 30. The exam points were in average 25 points, The attitude questions of the female were answered in average with lower points (average ~2.9) than the male (average ~3.3).The deep question results are quite the same between male and female (m:3.7;f:3.6). Also for strategic (f:3.6;m:3.5) and surface questions it is very similar (f:2.8;m:2.4). Also the correlation in total (black) and the correlation of male and female are displayed. Significant negative correlation is between surface questions and attitude questions.The overall neg. correlation is -0.176 with a p-Value < 0.05. for the female there is no significant correlation but for the male it is -0.374 with a p-value < 0.01. In an easier way, if the male answered the attitude questions with higher points the surface questions were answered with lower points and other way round. A very high significant negative correlation with a p-value of 0.001 is between deep questions and surface question (-0.324) Here is also only a significance for the male recognized (-0.622). Also here, if the male subjects answered the question of deep with high points the surface questions were answered with lower points and verce vice. An example for a positve significant correlation is between attitude and points (0.437, p-value<0.001). The older the subject the higher rated the attitude questions. This was for the female subjects (0.422, p-valuez0.001) as wel as for the male participants (0.451, p-value<0.001). Another negative correlation is between the strategic questions and the surface question (p<0.05). However only in overall (-0.161).

3. fitting a regression model

my_model_p <- lm(points ~ attitude + stra + surf, data = learning2014)
summary(my_model_p)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

This a regression model with points as dependent variable and the explanatory variables are attitude, strategic, surface. This was chosen due to highest correlations (see graph above). The first results are a summary of the residuals.The residuals are the distance from the data to the fitted line we created. Ideally the data points should be symmetrical distributed around the line (this would be the case if the min and max value has approximately hte same distance to 0 and also the first and third quartile has the same distance to 0, ideally median should be close to 0). The median in our model is near 0 and also the 1Q and 3Q are quite equal. However, the min and max values are not similar. An indicator that the model might not be the best. Next, the Coefficients are presented. For this model (linear regression) the formula is: **points = alpha (=intercept) + battitude + b1stra + b2*surf + e** The estimated values (first column) are the slope of the formula. This means for example, if attitude questions would be increased by 1 point (Lieke scale 1-5), the number of points in the exam would increase by ~3.4.For 1 "unit" strategic questions exam points increase by 0.85 and for the surface questions the exam points would decrease by -0.6. The standard error of estimate and t-value (column 2+3) are provided to show how the p-value were calculated. The pr values is the p-value for estimated parameters. We can see here that only attitude has a significance in this model. This means it seems that only the attitude questions have an influence on exam points. e is a random error component. The R^2 is 0.2074. So attitude questions, surface questions and strategic questions can explain ~21% the exam points. This is not very high. The adjustes R^2 is always smaller. It takes the vumers of variables into acount (or better the degree of freedom). This values is 0.1927. The significance for this modrl is very high (3.15e-08). This means that there is a relation between the exam points and the explinatory variables. The questions that rise up are: Are the variables are correlated to each other?(Can not be answered yet-> don't know how to do it in R :) ) And the non significant variables in this model, do we need them for a good model? The new model without the significant variables are shown in 4..

my_model_p1<-lm(points ~ age + attitude + stra , data=learning2014)
summary(my_model_p1)
## 
## Call:
## lm(formula = points ~ age + attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## age         -0.08822    0.05302  -1.664   0.0981 .  
## attitude     3.48077    0.56220   6.191 4.72e-09 ***
## stra         1.00371    0.53434   1.878   0.0621 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08

This model this better to fit. The adjusted R^2 is higher than in the previews model. For age and str variables the p value for the estimated slope values is 0.1 not very low and actually not significant but still better than in the first model.

4. new fitted model

my_model_p2<-lm(points ~ attitude + stra , data=learning2014)
summary(my_model_p2)

This model takes the explanatory variables attitude and stra into account to explain the exam points. The results are quite similar. The adjusted R^2 is a little bit higher with R^2= 0.1951. This means the model could be a bit better than my_model_p. The surface questions have probably no influence or very very low influence on exam points. The strategic questions would now take less influence (0.9137) on the exam as well as the attitude (3.467). However, the strategic estimated value has a lower p-value (0.0621) than before. The overall p-value of 7.734e-09 is also better compared to the first model my_model_p.

my_model_p3<- lm(points ~ attitude, data = learning2014)
summary(my_model_p3)

This is now the model with only one explanatory variable attitude. The residual summary is worse compared to the other models. The estimated value for attitude is now 3.5255. The exam points increase (improve) about 3.5 points if the attitude questions in mean increase about 1 point. There is a relation between both variables because the the attitude is high significant (p-value <0.001). The R^2 is lower compared to the other models. It says that 19% of exam points can be explained by the attitude. however, we should consider the adjusted R^2. It is scaled by the number of parameters in the model. If we add more and more variables, the R^2 would also increase. But it doesn't say if these variable contribute to the dependent variable (If they have a relation to that variable). The adjusted R^2 is calculated by taking the degrees of freedom (which means also the explanatory variables) into account. The power of the model is then reduced.

As a conclusion, model_p1 (age, attitude, stra) is the best fitted model if we look at the adjusted R^2. The second best would be model_p3 (only attitude variable), Therefore these both models will be graphical displayed in task 5.

5. Diagnostic plots

my_model_gra <- lm(points ~ attitude + stra, data = learning2014)
par(mfrow = c(2,2))
plot(my_model_gra, which = c(1,2,5),main = "Graphical model validation")

my_model_gra_alt <- lm(points ~ attitude + stra + age, data = learning2014)
par(mfrow = c(2,2))
plot(my_model_gra_alt, which = c(1,2,5),main = "Graphical model validation 2")

my_model_gra_alt1 <- lm(points ~ attitude, data = learning2014)
par(mfrow = c(2,2))
plot(my_model_gra_alt1, which = c(1,2,5),main = "Graphical model validation 3")

The assumptions of the linear regression:

  1. Normal distribution FOr any fixed value of x. Y is normally distributed
  2. Equal Error Variance/Homoscedasticity Constant Error Variance is the same as
  3. Linearity The intercept, slope and error should be linear and additive. The relationship between X and the mean of Y is linear.
  4. Independent Observations (Independent Error Term). All observations are independent.

The "Residuals vs Fitted" graph shows the residuals from the data and if they have non-linear patterns. This means they are random which means that there is a Equal Error Variance. So this assumption is fulfilled. This graph is also used for Independent Observations. We can see no correlation in this graph, so this assumption is also fine.

Normal Distribution is checked from a Normal Q-Q plot. This shows us if the errors/residuals are normally distributed. This is not the case in our data set. We have at the beginning and at the end some outliers that don't fit to the line. This assumption is not fully fulfilled. We have for our data a little negative skew. However, we should think about if these outlier are influential to this model. This will be checked with the last plot.

The Residuals vs Leverage shows any influential cases. Outlines don't have to be always influential in a regression model. Cases outside the Cook's distance line are outliers that would influence the regression model and should be eventually excluded. Here we cannot even see the Cook's distance line in the graph. Which means we have no influential outliers.

Taken all together, the model is ok, but it doesn't explain quite well the relation for the exam points (20% can be explain by attitude and strategic questions). Probably, better models should be applied. For me the best model is p1 with explanatory variables: attitude, age and strategic.

Feeling of the week

Feeling of the week


Chapter 3: Logistic regression

Data Analysis

1./2. reading the data

alc <- read.csv("https://github.com/rsund/IODS-project/raw/master/data/alc.csv")

To look at the dimension and structure of the data. FOr a description of the variables see here

dim(alc)
## [1] 370  51
str(alc)
## 'data.frame':    370 obs. of  51 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  15 15 15 15 15 15 15 15 15 15 ...
##  $ address   : chr  "R" "R" "R" "R" ...
##  $ famsize   : chr  "GT3" "GT3" "GT3" "GT3" ...
##  $ Pstatus   : chr  "T" "T" "T" "T" ...
##  $ Medu      : int  1 1 2 2 3 3 3 2 3 3 ...
##  $ Fedu      : int  1 1 2 4 3 4 4 2 1 3 ...
##  $ Mjob      : chr  "at_home" "other" "at_home" "services" ...
##  $ Fjob      : chr  "other" "other" "other" "health" ...
##  $ reason    : chr  "home" "reputation" "reputation" "course" ...
##  $ guardian  : chr  "mother" "mother" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 2 1 2 2 2 1 ...
##  $ studytime : int  4 2 1 3 3 3 3 2 4 4 ...
##  $ schoolsup : chr  "yes" "yes" "yes" "yes" ...
##  $ famsup    : chr  "yes" "yes" "yes" "yes" ...
##  $ activities: chr  "yes" "no" "yes" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "yes" "yes" "no" "yes" ...
##  $ romantic  : chr  "no" "yes" "no" "no" ...
##  $ famrel    : int  3 3 4 4 4 4 4 4 4 4 ...
##  $ freetime  : int  1 3 3 3 2 3 2 1 4 3 ...
##  $ goout     : int  2 4 1 2 1 2 2 3 2 3 ...
##  $ Dalc      : int  1 2 1 1 2 1 2 1 2 1 ...
##  $ Walc      : int  1 4 1 1 3 1 2 3 3 1 ...
##  $ health    : int  1 5 2 5 3 5 5 4 3 4 ...
##  $ n         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ id.p      : int  1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
##  $ id.m      : int  2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
##  $ failures  : int  0 1 0 0 1 0 1 0 0 0 ...
##  $ paid      : chr  "yes" "no" "no" "no" ...
##  $ absences  : int  3 2 8 2 5 2 0 1 9 10 ...
##  $ G1        : int  10 10 14 10 12 12 11 10 16 10 ...
##  $ G2        : int  12 8 13 10 12 12 6 10 16 10 ...
##  $ G3        : int  12 8 12 9 12 12 6 10 16 10 ...
##  $ failures.p: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ paid.p    : chr  "yes" "no" "no" "no" ...
##  $ absences.p: int  4 2 8 2 2 2 0 0 6 10 ...
##  $ G1.p      : int  13 13 14 10 13 11 10 11 15 10 ...
##  $ G2.p      : int  13 11 13 11 13 12 11 10 15 10 ...
##  $ G3.p      : int  13 11 12 10 13 12 12 11 15 10 ...
##  $ failures.m: int  1 2 0 0 2 0 2 0 0 0 ...
##  $ paid.m    : chr  "yes" "no" "yes" "yes" ...
##  $ absences.m: int  2 2 8 2 8 2 0 2 12 10 ...
##  $ G1.m      : int  7 8 14 10 10 12 12 8 16 10 ...
##  $ G2.m      : int  10 6 13 9 10 12 0 9 16 11 ...
##  $ G3.m      : int  10 5 13 8 10 11 0 8 16 11 ...
##  $ alc_use   : num  1 3 1 1 2.5 1 2 2 2.5 1 ...
##  $ high_use  : logi  FALSE TRUE FALSE FALSE TRUE FALSE ...
##  $ cid       : int  3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...

This matched data set contains now 370 obervations and 51 variables. The data are student achievement in secondary education of 2 schools. It is shown from which school the pupils are, their age, sex, if they live in urban or rural area, etc and it cointain also their alcohol consumption and weather it is high or not.

3. hypothesis towards alcohol

The use of alcohol can have certain numbers of reason. For my analysis I will use the gender, the class failure, quality of family relationship, going out with friends. The gender is interesting because male and female handle the alcohol consumption in a different way. normally females drink more often with friends, whereas males also drink more often alone. ALso the drinks are usually different. Males drink more beer and strong liquids, females more sparkling wine, wine and longdrinks.SO male drink more than female (my hypothesis :) ) Class failure could be also interesting. The more failure the more the tendency to drink. IS the family a supportive and cosy environment (quality of fam relationship), the alcohol consumption should be less (less wrong friends, stronger resilience) And of course, if you go out with friends the tendency to drink will be probably higher. Even when you have not a good family life.

4. testing the hypothesis

barplot

library(tidyr); library(dplyr); library(ggplot2)
## Warning: package 'tidyr' was built under R version 4.0.3
## Warning: package 'dplyr' was built under R version 4.0.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

Unfortunately, you cannot see well the graphs. However, you it is shown that most of the students have no failures. Just a few have 1 or 2 past class failures. Most of the students have good family relationships (4) or even very good relationships (5).The variable goout looks quite normal distributed. most of them go moderate out (3). In this data file a slight higher number of female students were asked.

crosstable

alc %>% group_by(sex, high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups:   sex [2]
##   sex   high_use count
##   <chr> <lgl>    <int>
## 1 F     FALSE      154
## 2 F     TRUE        41
## 3 M     FALSE      105
## 4 M     TRUE        70
alc %>% group_by(failures, high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'failures' (override with `.groups` argument)
## # A tibble: 8 x 3
## # Groups:   failures [4]
##   failures high_use count
##      <int> <lgl>    <int>
## 1        0 FALSE      238
## 2        0 TRUE        87
## 3        1 FALSE       12
## 4        1 TRUE        12
## 5        2 FALSE        8
## 6        2 TRUE         9
## 7        3 FALSE        1
## 8        3 TRUE         3
alc %>% group_by(famrel, high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'famrel' (override with `.groups` argument)
## # A tibble: 10 x 3
## # Groups:   famrel [5]
##    famrel high_use count
##     <int> <lgl>    <int>
##  1      1 FALSE        6
##  2      1 TRUE         2
##  3      2 FALSE        9
##  4      2 TRUE         9
##  5      3 FALSE       39
##  6      3 TRUE        25
##  7      4 FALSE      128
##  8      4 TRUE        52
##  9      5 FALSE       77
## 10      5 TRUE        23
alc %>% group_by(goout, high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'goout' (override with `.groups` argument)
## # A tibble: 10 x 3
## # Groups:   goout [5]
##    goout high_use count
##    <int> <lgl>    <int>
##  1     1 FALSE       19
##  2     1 TRUE         3
##  3     2 FALSE       82
##  4     2 TRUE        15
##  5     3 FALSE       97
##  6     3 TRUE        23
##  7     4 FALSE       40
##  8     4 TRUE        38
##  9     5 FALSE       21
## 10     5 TRUE        32

If we look at the crosstables, which shows the relationship between high use of alcohol and the other variables, we see that more female student drink less (154 females to 105 males) alcohol and more male students drink more alcohol(70 males to 41 females). --> Hypotheses is fulfilled. For the failures, we can see that no fails of students have the highest number of low alcohol use (238) but also for high alcohol use (87). This means the hypothesis is not fulfilled. But if we look at the relation to the students in each category, the alcohol use is higher - hypothesis fulfilled. For the family relation is the higher the quality of th relationships is the higher is the number of low alcohol drinker but also the more high use of alcohol is recognized. The highest number of high use of alcohol is at good relationships with 52 students (grade 4).But it has also the highst number no low alcohol drinker (128) It seems that also here that the (hypothesis is not fulfilled). However if we look at the realtions of false and true of each grade, it might looks different. Is the relationship not good, 1/4-1/2 drink more alcohol. This is getting lower the higher the quality of relationships are.Hypothesis fulfilled if we look at each category sepratly. For the times, how often students go out and how much they drink, we can see the more student go out the more they drink. Hypothesis fulfilled.

boxplot

fail<- ggplot(alc, aes(x = sex, y = alc_use, col=failures>0))+ geom_boxplot() + ggtitle("Relation of failures and alcohol consumption and sex")
fail

fail1<-ggplot(alc, aes(x = high_use, y = failures, col = sex))+ geom_boxplot() + ggtitle("Relation of failures, alcohol consumption and sex")
fail1

fam<- ggplot(alc, aes(x = high_use, y = famrel))+ geom_boxplot() + ggtitle("Relation of quality in family relationship and alcohol consumption")
fam

fam1<-ggplot(alc, aes(x = high_use, y = famrel, col = sex))+ geom_boxplot() + ggtitle("Relation of quality in family relationship, alcohol consumption and sex")
fam1

go<- ggplot(alc, aes(x = high_use, y = goout))+ geom_boxplot() + ggtitle("Relation of going out with friends and alcohol consumption")
go

go1<-ggplot(alc, aes(x = high_use, y = goout, col = sex))+ geom_boxplot() + ggtitle("Relation of going out with friends, alcohol consumption and sex")
go1

Here are the boxplots. They show an overall view of the data. For the failure, the normal boxplot seems not to be suitable (Don't know why...). No statement can be made. Therefore I used the alc_use, and here you can see that the more falures the more alcohol is used usually. For the family relation we see that the higher the family relationships were ranked the less alcohol were drunken. The hypothesis can be confirmed. It seems also that more male studens have a better relationship with the family. ALso the more the students go out, the more the alcohol they drink. Also more male students go out and drink also more. Hypothesis is fulfilled.

5. logistic regression

model + odd ratio and confidence intervalls (CI)

m <- glm(high_use ~ goout + famrel + sex+ failures, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ goout + famrel + sex + failures, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7648  -0.7602  -0.5209   0.8660   2.4221  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3222     0.6487  -3.580 0.000344 ***
## goout         0.7663     0.1230   6.230 4.65e-10 ***
## famrel       -0.4178     0.1407  -2.969 0.002991 ** 
## sexM          0.9495     0.2583   3.676 0.000237 ***
## failures      0.4754     0.2237   2.126 0.033544 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 374.44  on 365  degrees of freedom
## AIC: 384.44
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)       goout      famrel        sexM    failures 
##  -2.3222263   0.7663065  -0.4178083   0.9495276   0.4754263
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp 
## Waiting for profiling to be done...
cbind(OR, CI)
##                     OR      2.5 %    97.5 %
## (Intercept) 0.09805504 0.02652005 0.3401879
## goout       2.15180388 1.70202235 2.7597607
## famrel      0.65848847 0.49780462 0.8659968
## sexM        2.58448845 1.56615523 4.3203098
## failures    1.60869986 1.04482774 2.5276864

THe summery of the model shows first a summary of residuals. This looks fine since they are close to being centered on = and roughly symmetrical. The coefficients are shown afterwards, including the standard error, z value and the p value of z. All chosen variables show significance which means they can explain the high_use variable. The highest significance is for male students and goout. We can assume that for one more unit (answering the questions by 1 point more) the high_use alc increase by 0.7663 (for goout), 0.47 (for failures) and about 0.95 if it is a male student. It decreases around -0.4 if the fam question is increased around 1 point. The AIC (Akaike Information Criterion) is 384.44. Can be used to compare the model to other models. (It is just the residual deviance adjusted for the number of parameters in the model)

After the model, you can find the coefficients showed a odd ratios. Odd ratios are not the probabilities but the show the ratio of change that something happens to the change that something not happens. It quantifies the association between 2 events. Is the odd ratio greater than 1 we have a positive association and if the odd ratio is smaller than 1 its a negative association. For our model we can say that failure, goout and male have a positive association to higher use of alcohol, whereas the quality of family relationship is negatively associated to high_alc use. Compared to the former hypothesis, these results agree to the hyposesis.

Because failure is not so high significant as the other, here are the logistic regression model without it:

m1 <- glm(high_use ~ goout + famrel + sex, data = alc, family = "binomial")
summary(m1)
## 
## Call:
## glm(formula = high_use ~ goout + famrel + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5998  -0.7657  -0.5342   0.8074   2.5563  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.2863     0.6453  -3.543 0.000396 ***
## goout         0.7979     0.1226   6.506 7.71e-11 ***
## famrel       -0.4350     0.1400  -3.107 0.001892 ** 
## sexM          0.9906     0.2561   3.867 0.000110 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 379.11  on 366  degrees of freedom
## AIC: 387.11
## 
## Number of Fisher Scoring iterations: 4

Here you can see that the AIC is slightly higher, so the model seems to fit better. Therefore I will use these variables for the predictive model.

6. predictive model

m <- glm(high_use ~ goout + famrel + sex, data = alc, family = "binomial")

probabilities <- predict(m, type = "response")# predict() the probability of high_use

alc <- mutate(alc, probability = probabilities)# add the predicted probabilities to 'alc'

alc <- mutate(alc, prediction = probability > 0.5)# use the probabilities to make a prediction of high_use

select(alc, goout, famrel, sex, high_use, probability, prediction) %>% tail(10)
##     goout famrel sex high_use probability prediction
## 361     3      5   M     TRUE  0.25406225      FALSE
## 362     3      4   M     TRUE  0.34478448      FALSE
## 363     1      4   M     TRUE  0.09640288      FALSE
## 364     4      3   M     TRUE  0.64356587       TRUE
## 365     2      3   M    FALSE  0.26797363      FALSE
## 366     3      4   M     TRUE  0.34478448      FALSE
## 367     2      4   M    FALSE  0.19155369      FALSE
## 368     4      4   M     TRUE  0.53888551       TRUE
## 369     4      5   M     TRUE  0.43065938      FALSE
## 370     2      4   M    FALSE  0.19155369      FALSE
table(high_use = alc$high_use, prediction = alc$prediction)# creating a confusion matrix with actual values
##         prediction
## high_use FALSE TRUE
##    FALSE   241   18
##    TRUE     62   49
(241+49)/(241+49+18+62)#accuracy of the model
## [1] 0.7837838
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins  # creating a confusion matrix with predicted values
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65135135 0.04864865 0.70000000
##    TRUE  0.16756757 0.13243243 0.30000000
##    Sum   0.81891892 0.18108108 1.00000000

The first table (confusion matrix with actual values) shows that 241 students who actually don't drink much alcohol and this was also predicted by the model. 18 cases were wrong predicted, so that these 18 students drink actually less alcohol but the model predict is as high alcohol consume. For high use, the model predicted 49 students as right but 62 students were considers as less alcohol use although, which is wrong. As a result we can say the model is accurate for 78,4 percent. The second table shows the predicted values, so the probabilities.

visual probability

g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))+ geom_point() + ggtitle("logistic regression model with the variable famrel, sex and goout")
g 

This is the visualization of the model.

training error

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = alc$high_use, prob = alc$probability) # compute the average number of wrong predictions in the (training) data
## [1] 0.2162162
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc)) # K-fold cross-validation, cv=training data

cv$delta[1]# average number of wrong predictions in the cross validation
## [1] 0.2216216

To see how good the model is, a test will be conducted. For that new test variables will be used. The wrong predicted value number is 22. Which is quite ok.

Bonus question

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = alc$high_use, prob = alc$probability) # compute the average number of wrong predictions in the (training) data
## [1] 0.2162162
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10) # K-fold cross-validation, cv=training data

cv$delta[1]# average number of wrong predictions in the cross validation
## [1] 0.2324324

My model has a better test set perfomance, because it has 24 than 26 in the Data camp.

Well, this week...

Well, this week...


Clustering and Classification

2. Exploring the data "Bosten"

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston") #loading the data
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
pairs(Boston)

This data set comprises 506 observations and 14 variables. This data set describes the crime rate by town (crim). Further varaibles are pupil teacher ratio by town (ptratio), lower status of the population in percent (lstat), average number of rooms per dwelling (rm), full-value property-tax rate per 10000$ (tax). All the other variables descriptions can be seen here.

3. visualization

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble  3.0.4     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## v purrr   0.3.4
## Warning: package 'tibble' was built under R version 4.0.3
## Warning: package 'readr' was built under R version 4.0.3
## Warning: package 'purrr' was built under R version 4.0.3
## Warning: package 'stringr' was built under R version 4.0.3
## Warning: package 'forcats' was built under R version 4.0.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x MASS::select()  masks dplyr::select()
library(corrplot)# to draw all corplots
## Warning: package 'corrplot' was built under R version 4.0.3
## corrplot 0.84 loaded
cor_matrix<-cor(Boston)%>% round(2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58    0.29
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31   -0.39
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72    0.38
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04   -0.12
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67    0.19
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29   -0.36
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51    0.26
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53   -0.23
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91    0.46
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00    0.46
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46    1.00
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44   -0.18
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54    0.37
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47   -0.51
##         black lstat  medv
## crim    -0.39  0.46 -0.39
## zn       0.18 -0.41  0.36
## indus   -0.36  0.60 -0.48
## chas     0.05 -0.05  0.18
## nox     -0.38  0.59 -0.43
## rm       0.13 -0.61  0.70
## age     -0.27  0.60 -0.38
## dis      0.29 -0.50  0.25
## rad     -0.44  0.49 -0.38
## tax     -0.44  0.54 -0.47
## ptratio -0.18  0.37 -0.51
## black    1.00 -0.37  0.33
## lstat   -0.37  1.00 -0.74
## medv     0.33 -0.74  1.00
corrplot.mixed(cor_matrix, lower.col="black", number.cex = .8 )

cor_matrix1<-cor(Boston)%>% round()
cor_matrix1
##         crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
## crim       1  0     0    0   0  0   0   0   1   1       0     0     0    0
## zn         0  1    -1    0  -1  0  -1   1   0   0       0     0     0    0
## indus      0 -1     1    0   1  0   1  -1   1   1       0     0     1    0
## chas       0  0     0    1   0  0   0   0   0   0       0     0     0    0
## nox        0 -1     1    0   1  0   1  -1   1   1       0     0     1    0
## rm         0  0     0    0   0  1   0   0   0   0       0     0    -1    1
## age        0 -1     1    0   1  0   1  -1   0   1       0     0     1    0
## dis        0  1    -1    0  -1  0  -1   1   0  -1       0     0     0    0
## rad        1  0     1    0   1  0   0   0   1   1       0     0     0    0
## tax        1  0     1    0   1  0   1  -1   1   1       0     0     1    0
## ptratio    0  0     0    0   0  0   0   0   0   0       1     0     0   -1
## black      0  0     0    0   0  0   0   0   0   0       0     1     0    0
## lstat      0  0     1    0   1 -1   1   0   0   1       0     0     1   -1
## medv       0  0     0    0   0  1   0   0   0   0      -1     0    -1    1
corrplot.mixed(cor_matrix1, lower.col="black", number.cex = .8 )

The graph and the new matrix shows the correlation of the variables. As you can see, there are some correlations (best seen in cor_matrix1, without decimals). The second plot shows only the strongest correlations The red dots show a negative correlation, for example the nitrogen oxidation is positive correlated with the proportion of non-retail business acres per town and a negative correlation with the proportion of residential land zoned for lots over 25,000sq.ft.
Also it seems that there exist more postiv correlations compared to negative ones.

4. scaled data

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
boston_scaled<-as.data.frame(boston_scaled)

The data are now scaled. The data were scaled by the following formula: scaled(x)=x−mean(x)/sd(x) Actually, the data are not changed, but the scale itself. Now we can compare the variables because we have no variables that have the same scale.

summary(boston_scaled$crim) #summary of crim
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
bins<-quantile(boston_scaled$crim) #changing in quantiles
crime<-cut(boston_scaled$crim, breaks=bins, include.lowest = T, labels = c("low", "med_low", "med_high", "high")) #labeling
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
boston_scaled <- dplyr::select(boston_scaled, -crim) # droping crim rate
boston_scaled<-data.frame(boston_scaled, crime)

#splitting in train and test set
n<-nrow(boston_scaled)
ind<-sample(n, size = n*0.8)
train<-boston_scaled[ind,]
test<-boston_scaled[-ind,]
correct_class <- test$crime
test<- dplyr::select(test, -crime)

5. linear discriminant analysis (LDA)

lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2574257 0.2599010 0.2450495 0.2376238 
## 
## Group means:
##                   zn      indus         chas        nox         rm        age
## low       0.94502108 -0.8968267 -0.120902136 -0.8784534  0.4581364 -0.8929141
## med_low  -0.09440395 -0.2825507 -0.009855719 -0.5637292 -0.1043257 -0.3316601
## med_high -0.39109146  0.1509168  0.045820446  0.3439877  0.1277359  0.3854632
## high     -0.48724019  1.0172418 -0.108283225  1.0451282 -0.3787758  0.8481358
##                 dis        rad        tax    ptratio      black       lstat
## low       0.8420440 -0.6837115 -0.7135306 -0.4604641  0.3762914 -0.76724288
## med_low   0.3310288 -0.5574853 -0.4747238 -0.0626037  0.3187491 -0.16320321
## med_high -0.3554914 -0.3855964 -0.2958002 -0.1474263  0.1205950 -0.02972319
## high     -0.8563115  1.6368728  1.5131579  0.7793151 -0.7426174  0.92861709
##                   medv
## low       0.5539872380
## med_low  -0.0004604597
## med_high  0.1488744571
## high     -0.7364743576
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.11342853  0.714697641 -0.89398336
## indus   -0.02176815 -0.166091136  0.37621847
## chas    -0.08229289 -0.062296925  0.17611035
## nox      0.45431317 -0.902512359 -1.48665384
## rm      -0.06181006 -0.062034049 -0.11368681
## age      0.26538566 -0.264387242 -0.10904244
## dis     -0.05804669 -0.282649540  0.12960368
## rad      2.98898979  1.014386823 -0.08847580
## tax     -0.03887459 -0.069074601  0.53791633
## ptratio  0.13968815 -0.085524044 -0.33705316
## black   -0.12159935 -0.008102415  0.06863129
## lstat    0.22861805 -0.105222024  0.41900261
## medv     0.15270178 -0.375686399 -0.26723590
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9466 0.0389 0.0145
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2,col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Here you can see the LDA model

6. testing the model on test variables

lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_class, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       14       8        1    0
##   med_low    1      17        3    0
##   med_high   0       7       20    0
##   high       0       0        0   31

As you can see, the model can predict at its best the high quantile of the crimes with only 1 miss predicted value. The model has its most problems in deciding if the crimes are in the category med_low or med_high.

7. k-means algorithm

library(MASS)
data("Boston")
bos_scale<-scale(Boston)
summary(bos_scale)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
dis_boston<-dist(bos_scale) #eucline distance
summary(dis_boston)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
km <-kmeans(Boston, centers = 4)# k-means clustering
pairs(Boston[1:5], col = km$cluster)# plot the Boston dataset with clusters

pairs(Boston[6:10], col = km$cluster)

km <-kmeans(Boston, centers = 3)# less clusters
pairs(Boston[1:5], col = km$cluster)

pairs(Boston[6:10], col = km$cluster)

km <-kmeans(Boston, centers = 5)# more clusters
pairs(Boston[1:5], col = km$cluster)

pairs(Boston[6:10], col = km$cluster)

Alone with visualization it is not possible to determine the clusters. Therefore we need to determine k for k-means. Fortunately R can show graphically how many clusters are exist.

set.seed(123)
k_max <- 5# determine the number of clusters
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})# calculate the total within sum of squares
qplot(x = 1:k_max, y = twcss, geom = 'line')# visualize the results

km <-kmeans(Boston, centers = 2)# k-means clustering
pairs(Boston[1:5], col = km$cluster)# plot the Boston dataset with clusters

pairs(Boston[6:10], col = km$cluster)

The number of clusters is 2. The total cluster sum of squares (WCSS) has changed rapidly between 1 and 2 clusters, which means the cluster is 2. We can see quite often vertical clusters. So is for example between crim and zn. One cluster is vertical around low residential land but high er crime rate and the other cluster is spreaded horizontal around low crime rate. Also is one cluster vertical around high nitrogen rate but low crime rate and one cluster between 0.6 and 0.8 nitrogene rate and rising crime rate. Also, these clusters show us in variable taxes, that mostly the taxes were around 200 and 500. These clustering method is a good visualization to find objects which are more similar to some than to others.

Whoop Whoop, we are getting closer

Whoop Whoop, we are getting closer


Dimensionality reduction techniques

1. explore the data set

human<-read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep=",", header=TRUE)
library(MASS)
library(tidyverse)
library(ggplot2)
library(corrplot)
library(GGally)
pairs(human)

ggpairs(human, lower = list(combo = wrap("facethist", bins = 20)))

human_cor<- cor(human)%>%round(2)
human_cor
##           Edu2.FM Labo.FM Edu.Exp Life.Exp   GNI Mat.Mor Ado.Birth Parli.F
## Edu2.FM      1.00    0.01    0.59     0.58  0.43   -0.66     -0.53    0.08
## Labo.FM      0.01    1.00    0.05    -0.14 -0.02    0.24      0.12    0.25
## Edu.Exp      0.59    0.05    1.00     0.79  0.62   -0.74     -0.70    0.21
## Life.Exp     0.58   -0.14    0.79     1.00  0.63   -0.86     -0.73    0.17
## GNI          0.43   -0.02    0.62     0.63  1.00   -0.50     -0.56    0.09
## Mat.Mor     -0.66    0.24   -0.74    -0.86 -0.50    1.00      0.76   -0.09
## Ado.Birth   -0.53    0.12   -0.70    -0.73 -0.56    0.76      1.00   -0.07
## Parli.F      0.08    0.25    0.21     0.17  0.09   -0.09     -0.07    1.00
corrplot.mixed(human_cor, lower.col="black", number.cex = .8 )

str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
dim(human)
## [1] 155   8
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

This data set contains 8 variables and 155 observations. The variables are: Population.with.Secondary.Education rate (Edu2FM), Labour.Force.Participation.Rate.(Labo.FM), Expected.Years.of.Education (Edu.Exp), Life.Expectancy.at.Birth (Life.Exp), Gross.National.Income..GNI..per.Capita (GNI), Maternal.Mortality.Ratio (MAT.Mor.), Adolescent.Birth.Rate (Ado.Birth), Percent.Representation.in.Parliament (Parti.F). The ggplot and the corrplot.mixed shows the correlations between the variables. For example, the the life expectation is strongly correlated with the expected education. That means that the higher the the education the longer you life. Also the lower the education level the higher the maternal mortality rate.And of course the higher the maternal mortality rate the lower the life expectancy (highest negative correlation).

2. PCA

pca_hum<-prcomp(human) # PCA
s<-summary(pca_hum)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
pc_pr <- round(100*s$importance[2,], digits = 1)
pc_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
pc_lab <- paste0(names(pc_pr), " (", pc_pr, "%)")
biplot(pca_hum, cex = c(0.8, 1.5), col = c("lightslateblue", "cyan1"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Well, this seems quite strange. The scales are not the same. Here only one dimension explains all variances. Propably it is the GNI. But to interprete anything is quite difficult, even not possible I would say. Therefore, we need first to scale the data to have the same scale for all data.

3. standardized PCA

human_std<-scale(human)
pca_human<-prcomp(human_std) # PCA
s1<-summary(pca_human)
s1
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
pc_pr1 <- round(100*s1$importance[2,], digits = 1)
pc_pr1
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
pc_lab1 <- paste0(names(pc_pr1), " (", pc_pr1, "%)")
biplot(pca_human, cex = c(0.8, 1.5), col = c("lightslateblue", "cyan1"), xlab = pc_lab1[1], ylab = pc_lab1[2])

As described above, we cannot interpret the results without the same scale. Through scaling data are still the same, but we can now compare it. I would say the results may the same also PCA looked different because of the different scales. But only now we can make some correlations assumptions of these data. Probably, Parli F and Labo FM variables describe the PC2 whereas the other variables descirbe the other component.

4. personal interpretation

The table "importance of components" shows the standard deviation, which is simply the eigenvalue, the proportion of Variance, the amount og the variance of the component accounts for in the data and the Cumulative proportion which is the accumulated amount of explained variance (all 8 compnents give us 100% of the total variance). Already PC1 accounts 53.62% of the total variance in the data. We could ask us, how many components do we actually want if already component 1 has such a huge impact. Interesting is for example that the Nordic countries are all together.

screeplot(pca_human, type = "l", npcs = 15, main = "Screenplot of the 8 PCs")
abline(h = 1, col="red", lty=5)
legend("topright", legend=c("Eigenvalue = 1"),
       col=c("red"), lty=5, cex=0.6)

This graph shows us, that already the first 2 components explain 90% of the variance (they have an Eigenvalue>1). So these 2 components would be enough to explain it.

The standardized PCA shows that Maternal death rate and adolescent birth rate is strongly correlated/associated. Also Labo FM and Parli F. ALso the other 4 variables are correlated to each other (Edu2 FM, EDU Exp Life exp, GNI). But their arrows are around 180 degrees "away" from Mat. Mor. and Ado. Birth, it can be said that they are negatively correlated. All features which are horizontal located have almost no correlation to Labo FM and Parli F. Besides this, we can say that Martial Mort. and Ado birth contribute to PCA1 positivly and Labo FM and Parli F contribute positively to PCA2. Rwanda is a bit outside. that could be an outlier.

5. MCA

library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.0.3
data(tea)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch") # keep the clomun
tea_time <- dplyr::select(tea, one_of(keep_columns))
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar(colour='blue') + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

mca <- MCA(tea_time, graph = FALSE) # MCA for catecorical variable
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"), habillage = "quali")

#MCA plot for all data except age, because this is a continues variable
tea1<- subset(tea, select = -c(age)) # select function to drup 1 column
mca <- MCA(tea1, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea1, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.090   0.082   0.070   0.063   0.056   0.053   0.050
## % of var.              5.838   5.292   4.551   4.057   3.616   3.465   3.272
## Cumulative % of var.   5.838  11.130  15.681  19.738  23.354  26.819  30.091
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
## Variance               0.048   0.047   0.044   0.041   0.040   0.039   0.037
## % of var.              3.090   3.053   2.834   2.643   2.623   2.531   2.388
## Cumulative % of var.  33.181  36.234  39.068  41.711  44.334  46.865  49.252
##                       Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
## Variance               0.036   0.035   0.034   0.032   0.031   0.031   0.030
## % of var.              2.302   2.275   2.172   2.085   2.013   2.011   1.915
## Cumulative % of var.  51.554  53.829  56.000  58.086  60.099  62.110  64.025
##                       Dim.22  Dim.23  Dim.24  Dim.25  Dim.26  Dim.27  Dim.28
## Variance               0.028   0.027   0.026   0.025   0.025   0.024   0.024
## % of var.              1.847   1.740   1.686   1.638   1.609   1.571   1.524
## Cumulative % of var.  65.872  67.611  69.297  70.935  72.544  74.115  75.639
##                       Dim.29  Dim.30  Dim.31  Dim.32  Dim.33  Dim.34  Dim.35
## Variance               0.023   0.022   0.021   0.020   0.020   0.019   0.019
## % of var.              1.459   1.425   1.378   1.322   1.281   1.241   1.222
## Cumulative % of var.  77.099  78.523  79.901  81.223  82.504  83.745  84.967
##                       Dim.36  Dim.37  Dim.38  Dim.39  Dim.40  Dim.41  Dim.42
## Variance               0.018   0.017   0.017   0.016   0.015   0.015   0.014
## % of var.              1.152   1.092   1.072   1.019   0.993   0.950   0.924
## Cumulative % of var.  86.119  87.211  88.283  89.301  90.294  91.244  92.169
##                       Dim.43  Dim.44  Dim.45  Dim.46  Dim.47  Dim.48  Dim.49
## Variance               0.014   0.013   0.012   0.011   0.011   0.010   0.010
## % of var.              0.891   0.833   0.792   0.729   0.716   0.666   0.660
## Cumulative % of var.  93.060  93.893  94.684  95.414  96.130  96.796  97.456
##                       Dim.50  Dim.51  Dim.52  Dim.53  Dim.54
## Variance               0.009   0.009   0.008   0.007   0.006
## % of var.              0.605   0.584   0.519   0.447   0.390
## Cumulative % of var.  98.060  98.644  99.163  99.610 100.000
## 
## Individuals (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1             | -0.580  1.246  0.174 |  0.155  0.098  0.012 |  0.052  0.013
## 2             | -0.376  0.522  0.108 |  0.293  0.350  0.066 | -0.164  0.127
## 3             |  0.083  0.026  0.004 | -0.155  0.099  0.015 |  0.122  0.071
## 4             | -0.569  1.196  0.236 | -0.273  0.304  0.054 | -0.019  0.002
## 5             | -0.145  0.078  0.020 | -0.142  0.083  0.019 |  0.002  0.000
## 6             | -0.676  1.693  0.272 | -0.284  0.330  0.048 | -0.021  0.002
## 7             | -0.191  0.135  0.027 |  0.020  0.002  0.000 |  0.141  0.095
## 8             | -0.043  0.007  0.001 |  0.108  0.047  0.009 | -0.089  0.038
## 9             | -0.027  0.003  0.000 |  0.267  0.291  0.049 |  0.341  0.553
## 10            |  0.205  0.155  0.028 |  0.366  0.546  0.089 |  0.281  0.374
##                 cos2  
## 1              0.001 |
## 2              0.021 |
## 3              0.009 |
## 4              0.000 |
## 5              0.000 |
## 6              0.000 |
## 7              0.015 |
## 8              0.006 |
## 9              0.080 |
## 10             0.052 |
## 
## Categories (the 10 first)
##                  Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2 v.test  
## breakfast     |  0.182  0.504  0.031  3.022 |  0.020  0.007  0.000  0.330 |
## Not.breakfast | -0.168  0.465  0.031 -3.022 | -0.018  0.006  0.000 -0.330 |
## Not.tea time  | -0.556  4.286  0.240 -8.468 |  0.004  0.000  0.000  0.065 |
## tea time      |  0.431  3.322  0.240  8.468 | -0.003  0.000  0.000 -0.065 |
## evening       |  0.276  0.830  0.040  3.452 | -0.409  2.006  0.087 -5.109 |
## Not.evening   | -0.144  0.434  0.040 -3.452 |  0.214  1.049  0.087  5.109 |
## lunch         |  0.601  1.678  0.062  4.306 | -0.408  0.854  0.029 -2.924 |
## Not.lunch     | -0.103  0.288  0.062 -4.306 |  0.070  0.147  0.029  2.924 |
## dinner        | -1.105  2.709  0.092 -5.240 | -0.081  0.016  0.000 -0.386 |
## Not.dinner    |  0.083  0.204  0.092  5.240 |  0.006  0.001  0.000  0.386 |
##                Dim.3    ctr   cos2 v.test  
## breakfast     -0.107  0.225  0.011 -1.784 |
## Not.breakfast  0.099  0.208  0.011  1.784 |
## Not.tea time   0.062  0.069  0.003  0.950 |
## tea time      -0.048  0.054  0.003 -0.950 |
## evening        0.344  1.653  0.062  4.301 |
## Not.evening   -0.180  0.864  0.062 -4.301 |
## lunch          0.240  0.343  0.010  1.719 |
## Not.lunch     -0.041  0.059  0.010 -1.719 |
## dinner         0.796  1.805  0.048  3.777 |
## Not.dinner    -0.060  0.136  0.048 -3.777 |
## 
## Categorical variables (eta2)
##                 Dim.1 Dim.2 Dim.3  
## breakfast     | 0.031 0.000 0.011 |
## tea.time      | 0.240 0.000 0.003 |
## evening       | 0.040 0.087 0.062 |
## lunch         | 0.062 0.029 0.010 |
## dinner        | 0.092 0.000 0.048 |
## always        | 0.056 0.035 0.007 |
## home          | 0.016 0.002 0.030 |
## work          | 0.075 0.020 0.022 |
## tearoom       | 0.321 0.019 0.031 |
## friends       | 0.186 0.061 0.030 |
plot(mca, invisible=c("ind"), habillage = "quali")

The tea data set is about tea. 300 individuals were asked how they drink tea (18 questions), what are their product's perception (12 questions) and some personal details (4 questions). Here we have kept the Tea (which tea), How (without something, with something, if yes what), how (tea bag, unpacked or both), lunch(dringing at lunch yes or no), sugar(yes or no) and where(where to buy it). We can say that in overall, most people take tea bags, drinking earl gray, mostly pure and not during lunch time. The sugar intake is slightly lower than without sugar and more people buy it in a chain store.

The Multiple Correspondence Analysis analysis the pattern of relationships of several categorical variables. First we see the Eigenvalues: It shows the percentage of explained variances of each Dimensions. So has Dimension 1 15.24 % of the explained variance. The individuals shows the coordinates of the first 10 variables, but also the contribution to the construction of the axis(in percentage), squared correlation (cos2), which represents the quality of the representation. The contribution is quite small because we have 300 observations. The first 3 dimensions are displayed. For our out, we can say that the first individual has the coordination of -0,298, with a contribution of 0.106 and squared correlation of 0.086. Next we have the results of the categories. The output shows the coordinates, the contribution nd the cos2 and the v.test for all categories (ok, here they show only the first 10 rows). The v. test follows normal distribution and can be seen as significant from 0 if the values are above 2 (1.96) or below -2 (1.96). For the category black we have a significance in dimension1 (4.677) and dimension3 (-10.692) but not for the second dimension (0.929) The last tables shows the squared correlation of the categorical variable(~R2). If the value is close to 1 it has a strong correlation to that dimension. For example: how has the strongest correlation of all in dimension 1 with a value of 0.708. The weeks and actually absolutely no correlation has lunch with the dimension 1.

The first tea-time factor map shows the kept variables. We can say that between Earl Gray and milk but also sugar there is a similarity. So it seems that people who drink Earl gray (also most nothing in their tea) if they use milk then in Earl gray. An no sugar is more common in black than than in Earl gray. It is also shown that between unpacked tea and tea shop there is a similarity because these categories are closer. So people buys unpacked tea in tea shops. Whereas, teabags are bought in chain stores more often. Green tea is a bit special. It stays alone. But it has more similarities with a chain store than a tea shop but just minor differences (distances).

The second plot shows all categorical variables and their similarities. Here, it shows that Earl Grey has similarities with evenings. So they might drink more in the evenings. lunch and pub are very close. So mybe pople if the drink tea at lunch time than in a pup. Students are in the groups of 15-24 but they are not are otherwise far from the other variables. Maybe there were not so many individuals in that data set or the don't drink so ,uch tea.

whoop whoop, it still makes fun :)

whoop whoop, it still makes fun :)


Analysis of longitudinal data

Analysis of Longitudinal Data I: GraphicalDisplays and Summary Measure Approach

using the RATS data

library(dplyr)
library(tidyr)
RATSL <- read.csv("C:/Users/susac/Documents/R/IODS-project/data/RATSL.csv", sep = ",", header = TRUE, row.names = 1)
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,...
## $ Group  <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, ...
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, ...
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, ...
dim(RATSL)
## [1] 176   5
RATSL$ID <- factor(RATSL$ID) #factor again
RATSL$Group <- factor(RATSL$Group)
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,...
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, ...
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, ...
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, ...

The Rats data set encompasses the weight of the rats, the weekdays, the ID number and the group number. WE have 3 groups of rats getting a different diet.

visualization of the data

## <ScaleContinuousPosition>
##  Range:  
##  Limits:  225 --  628

The first graph shows that the rats in group 1 compared to group 2 and 3 have a lower start weight. In group 2 we have rat 12 which seems to be an outlier because its weight more than the other rats. Also in group 1 (rat 2) and group 3 (rat 13) have a different weight compared to the other rats within the group. In this graph it seems that in group 1 all mice gain weight, as well in group 2 and 3.

visualization of standardized data

## Rows: 176
## Columns: 6
## $ ID        <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1,...
## $ Group     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, ...
## $ WD        <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "...
## $ Weight    <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 55...
## $ Time      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, ...
## $ stdweight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.881900...

The standardized plot shows now, that in group 1 the overall weight is stable, of course within subject differences. In group 2 we have a weight gain except 1 rat, which is loosing weight. In group 3 it is 50:50 of weight gain.

visualization of a summary

n <- RATSL$Time %>% unique() %>% length()
RATSS <- RATSL %>%
  group_by(Time, Group) %>%
  summarise( mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
  ungroup()
## `summarise()` regrouping output by 'Time' (override with `.groups` argument)
ggplot(RATSS, aes(x = Time, y = mean, shape = Group, linetype= Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.9,0.5)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")

This plot shows the mean weight gain of each group against the time. Also here the graphs shows that the start weight is different between the groups. The weight gain of group 1 seems to be the slowest on, whereas group 2 has the fastest weight gain. Group 3 is between the other both groups regarding weight gain.

Applying the summary approach

library(dplyr)
library(tidyr)
RATSL8S <- RATSL %>%
  filter(Time > 0) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup() 
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
glimpse(RATSL8S)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 2...
# visualization with outliers
ggplot(RATSL8S, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "green") +
  scale_y_continuous(name = "mean(weight), days 1-64")

#visualization without 1 big outlier for group 2

RATSL8S1 <- RATSL8S %>%
  filter(ID !=2& ID !=12 & ID !=13)

ggplot(RATSL8S1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y  = "mean", geom = "point", shape=23, size=4, fill = "blue") +
  scale_y_continuous(name = "mean(weight), days 1-64 without outlier")
## Warning: `fun.y` is deprecated. Use `fun` instead.

This boxplot shows the overall weight of each group. We can see that all groups have an outlier. Especially group 2 has a huge one. (graph 2 shows the group without outlier in group 2). As already assumed, all three groups have a different total weight with the highest weight in group 3.

t-Test

Because we have 3 groups we cannot use t-test, because for a t-test it has to be exactly 2 groups. Therefore, we use the anova test.

Variance analysis

# only used the long wide data, and I don't know yet how to convert them in wide form again, I sued the original data again

RATS<-read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", sep="", header=TRUE)
RATS$ID <- factor(RATS$ID) #factor again
RATS$Group <- factor(RATS$Group)

glimpse(RATS)
## Rows: 16
## Columns: 13
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ WD1   <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 4...
## $ WD8   <int> 250, 230, 250, 255, 260, 265, 275, 255, 415, 420, 445, 560, 4...
## $ WD15  <int> 255, 230, 250, 255, 255, 270, 260, 260, 425, 430, 450, 565, 4...
## $ WD22  <int> 260, 232, 255, 265, 270, 275, 270, 268, 428, 440, 452, 580, 4...
## $ WD29  <int> 262, 240, 262, 265, 270, 275, 273, 270, 438, 448, 455, 590, 4...
## $ WD36  <int> 258, 240, 265, 268, 273, 277, 274, 265, 443, 460, 455, 597, 4...
## $ WD43  <int> 266, 243, 267, 270, 274, 278, 276, 265, 442, 458, 451, 595, 4...
## $ WD44  <int> 266, 244, 267, 272, 273, 278, 271, 267, 446, 464, 450, 595, 5...
## $ WD50  <int> 265, 238, 264, 274, 276, 284, 282, 273, 456, 475, 462, 612, 5...
## $ WD57  <int> 272, 247, 268, 273, 278, 279, 281, 274, 468, 484, 466, 618, 5...
## $ WD64  <int> 278, 245, 269, 275, 280, 281, 284, 278, 478, 496, 472, 628, 5...
RATSL8S2 <- RATSL8S %>%
  mutate(baseline = RATS$WD1)
fit <- lm(mean ~ baseline + Group, data = RATSL8S2) # using the linear model because ANOVA is a type of that

anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value    Pr(>F)    
## baseline   1 252125  252125 2237.0655 5.217e-15 ***
## Group      2    726     363    3.2219   0.07586 .  
## Residuals 12   1352     113                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova has some assumptions like linearity between covariante and outcome variable, homogeneity of regression slope, outcome variable should be normal distributed, homoscedasticity and no significant outliers. The outliers we already removed in the step above. The other assumptions we assume they are right. The output shows that the baseline measurements are strongly related to the weight values taken after the weekdays (p<0.001). However it seems that the diet made not big difference between the groups after 64WDs. The p-values are 0.1.

Analysis of Longitudinal Data II: LinearMixed Effects Models for Normal ResponseVariables

using the BPRS data

library(dplyr)
library(tidyr)
BPRSL <- read.csv("C:/Users/susac/Documents/R/IODS-project/data/BPRSL.csv", sep = ",", header = TRUE, row.names = 1)
glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ subject   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "we...
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 6...
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
dim(BPRSL)
## [1] 360   5
BPRSL$treatment <- factor(BPRSL$treatment) # factor again
BPRSL$subject <- factor(BPRSL$subject)

glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "we...
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 6...
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

These data set shows the bprs points of 40 men of 8 weeks. these 40 men were split randomly in 2 groups (treatment 1 and 2) at the beginning of the study.

visualization

ggplot(BPRSL, aes(x = week, y = bprs, color = subject)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ treatment, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))

We can see that almost all bprs values decrease over the 8 weeks and that higher bprs scores at the beginning have usually also higher scores at the end.

Linear Regression Model

BPRS_reg<-lm(bprs~week+treatment, BPRSL)
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

In this model we ignore the repeated measures structure. Also we assume that the repeated measures are independent. The output of the linear regression model shows that averaged the bprs scores decres around 2 points per week. This results is significant with a p-value of <0.001. It seems that the different treatment have no significant influence on the bprs scores (p-vale of treatment 2 =0.661).

Random Intercept Model

library(lme4)
## Warning: package 'lme4' was built under R version 4.0.3
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

The code looks just like what we used for regression with lm, but with an additional component specifying the group, the subject effect. This term (1 | subject) means that the intercept (represented by 1), to vary by subject. As we can see, the vareiance of the mean for the intercept is -47.41 with a standard deviation of 6.88. The t-value (0.5) is very close to 0 which indicates that this treatment 2 has no statistical significance to treatment 2.

Random Intercept and Random Slope Model

BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this model we add now another random effect, the random slpoe which is in our example the weeks (week | subject). This model allawos the indivifual differences in subjects. As you can directly see at the beginning the AIC value is slighty lower, this means that this model fits a bit better than the model above. The week has a negative correaltion of -0.51. This means that the subjects reduce the bprs points every week. However, there seems also no big differences in this model compared to the model above. It is a similar dvelopment within the weeks. Anova shows with a significane of 0.05 that this model is better than the model without the random slope as the AIC value already indicated.

Random Intercept and Random Slope Model with interaction

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4                       
## BPRS_ref2    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This model tests the interaction of week and treatment group (week*treatment). This model does not better fit than the other 2 because the AIC has the highest value. Therfore, all in all, we can say the model with random intercept and slope is the best fit model. This model is now plotted. As you can see the fitted model shows no fitting of the actual observations. But the tendency of the bprs score of each subject is the same as in the actual observations. There seems to be also no differences between the 2 treatments

Amazing, how short 6 weeks can be

Amazing, how short 6 weeks can be